Sparse recovery with unknown variance: 
a LASSO-type approach 
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Abstract 

We address the issue of estimating tlie regression vector /3 in the generic s-sparse linear model y = 
xp + z, with p G W, y e R", z ~ J\f{0,a^I) and p > n when the variance cr^ is unknown. We study 
two LASSO-type methods that jointly estimate p and the variance. These estimators are minimizers of the 
^1 penalized least-squares functional, where the relaxation parameter is tuned according to two different 
■ strategies. In the first strategy the relaxation parameter is of the order ay/log p, where a'^ is the empirical 
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variance. In the second strategy the relaxation parameter is chosen so as to enforce a trade-off between 
Q ■ the fidelity and the penalty terms at optimality For both estimators, our assumptions are similar to the ones 

. proposed by Candes and Plan in Ann. Stat. (2009), for the case where ct^ is known. We prove that our 

estimators ensure exact recovery of the support and sign pattern of p with high probability We present 
simulations results showing that the first estimator enjoys nearly the same performances in practice as the 
standard LASSO (known variance case) for a wide range of the signal to noise ratio. Our second estimator 
f-H . is shown to outperform both in terms of false detection, when the signal to noise ratio is low. 
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1 Introduction 
1.1 Problem statement 

■ The well-known standard Gaussian linear model reads 

y = XP + z, (1.1) 

^ . where X denotes a n x p design matrix, /3 G is an unknown parameter and the components 
of the error z are assumed i.i.d. with normal distribution A/'(0,(t^). The present paper aims at 
^ I studying this model in the case where the number of covariates is greater than the number of 
5^ ■ observations, n < p, and the regression vector /3 and the variance are both unknown. 

The estimation of the parameters in this case is of course impossible without further assump- 
tions on the regression vector (3. One such assumption is sparsity, i.e. only a few components of 
/3 are different from zero, say s components; (3 is then said to be s-sparse. There has been a great 
interest in the study of this problem recently. Recovering the support of (3 has been extensively 
studied in the context of Compressed Sensing, a new paradigm for designing observation matri- 
ces X. In this framework, it is now a standard fact that matrices X can be found (e.g. with high 
probability if drawn from sub-Gaussian i.i.d distributions) such that the number of observations 
needed to recover /3 exactly is proportional to s log (p/n). 
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1 .2 Existing results in the known variance case 

When the variance is known and positive, two popular techniques to estimate the regression 
vector (3 are the Least Absolute Shrinkage and Selection Operator (LASSO) [33], and the Dantzig 
selector [10]. We refer to [2] for a recent simultaneous analysis of these two methods. The 
standard LASSO estimator (5x of (3 is defined as 

e argmin ^||7/-X6||^ + A||6||i, (L2) 

where A > is a regularization parameter controlling the sparsity of the estimated coefficients. 

Sparse recovery cannot hold without some geometric assumptions on the dictionary (or the 
design matrix), as recalled in [22] pp. 4-5. The papers [37] and [38] introduced very pertinent 
assumptions for the study of variable selection problem using the LASSO in the finite sample 
(resp. asymptotic) contexts. 

One common assumption for the precise study of the statistical performance of these estimators 
is an incoherence property of the matrix X. This means that the coherence of X, i.e. the maximum 
scalar product of two (normalized) columns of X, is very small. Coherence based conditions 
appeared first in the context of Basis Pursuit for sparse approximation in [13], [17] and [14]. It 
then had a significant impact on Compressed Sensing; see [29] and [9]. 

The recent references [2], [5] and [21] contain interesting assumptions on the coherence in our 
context of interest, i.e. high dimensional sparse regression. For instance, [2] and [5] require a 
bound of the order ^J\ogn/n whereas [21] requires a bound of the order The recent paper 
[8] requires that the coherence of X is less than Cst/logp. Under the additional assumptions 
that ^ is sparse and assuming that the support and sign pattern are tmiformly distributed, they 
prove that ^ has the same support and sign pattern as ^ with probability 1 — p~^((27rlogp)"^/^ + 

Notice, as commented on in e.g. [8], that various assumptions in the literature, such as the 
invertibility of the restricted covariance matrix [37] indexed by the signal's true support and 
the Irrepresentable Condition in [38] can be derived from their incoherence condition, although 
with possibly suboptimal orders in certain instances. 

1 .3 Existing results in the unknown variance case 

The problem of estimating the variance in the sparse regression model has been addressed in 
only a few references until now. In [1] the authors analyze in the unknown variance setting AIC, 
BIC and AMDL based estimators, as well as estimators using a more general complexity penalty. 
As well known among practitioners, the LASSO procedure, at the price of certain assumptions 
on X, avoids the enumeration of all subsets of covariates, an intractable task when the number of 
covariates is large. This last property motivates the theoretical analysis provided in the present 
paper. 

In a recent work [4], a joint estimation procedure for both the regression vector and the variance 
is proposed. The authors give a detailed study of the risk under quite general conditions. In 
[31], it is proven in particular that, for the variance estimator of [4], under a compatibility 
condition introduced in [36], A||/3||i/(T = o(l) if and only if a/a = (1 + op(l)), for A such that 
P(A > a\\X^{Y — XP)/n\\oo/(T) 1 where a > 1 is any constant. The problem of support and sign 
pattern recovery as well as the one of providing non-asymptotic results with explicit constants 
are not addressed. 

1.4 Our contribution 

We study two different strategies in the present paper. 
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1.4. 1 Strategy (A): Plugging in ttie variance estimator 

Our work mainly aims at understanding when the results of [8] extend to the case where 
(T^ is unknown. In the case where is known, it is proven in [8] that the right order of 
magnitude for A is a^/\ogp. We first study the very natural estimator consisting of replacing 
cr by a = \\y — X/3\\2/\/n in the expression of A. As is standard in the study of the LASSO, the 
regression vector ^'s coefficients have to be significantly larger than the noise level for exact 
recovery of the support and sign pattern. 

The main differences between the known and the unknown variance cases are summarized 
in the following table. 



Known variance 


Unknown variance: Strategy (A) 


argmin "^"f + A||6||i 


^Aeargmin "^"f + A||6||i 


A = est ay/log p 


Tune A to A s.t. : A = C^ar^^/logp 
with : = "^-^^"^"^ 

n 


Convex problem 


Non convex problem 


Oraele /3 


Oracle 0,X) 


Conditions holding with 
high probability 


Similar conditions 



Notice that, in this table, (3 is defined via A and A is defined via (3. In other words, f3 and 
A jointly satisfy a set of optimality conditions. From a numerical viewpoint, /3 and A can be 
computed iteratively using a fixed point-type algorithm; see Section 5.3.1. 

1.4.2 Strategy (B): Enforcing a trade-off between fidelity and penalty 

Another possible strategy can be used to overcome the problem of estimating the regression 
vector /3 and the relaxation parameter A when the variance cr^ is unknown. This strategy consists 
of prescribing a trade-off between the fidelity term and the penalty term. More precisely, we 
will impose the constraint A||^||i/||y — X/3\\l = C. Enforcing such a trade-off between fidelity 
and penalty results in a more complex problem from both the statistical and the computationaly 
viewpoints. However, since A||/3||i and \\y — X(3\\l are, at least approximately, homogeneous 
functions of a^, using such a criterion allows to bypass the estimation of the variance in a first 

stage. The variance itself could be estimated in a second stage, using the formula = \\y_J^£xh^ 
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Known variance 


Unknown variance: Strategy (B) 


argmin "^"f'^^ + A||6||i 


e argmin "^"f + A ||6||i 


A = est ay/log p 


A||%||i = C||y-X^5^||i 


Convex problem 


Non convex problem 


wracie p 


wr dcie yp, A) 


Conditions holding with 
high probability 


Similar conditions 
+ Upper bound on ||^||i 



1.4.3 Results 

Our main results are Theorem 2.5, for Strategy (A), and Theorem 2.7, for Strategy (B). Both results 
can be described as follows. Given an arbitrary a > Q, we prove that, for regression vectors jS 
satisfying certain natural constraints, standard assumptions on the number of observations n 
and ihe sparsity s imply that our modified LASSO procedures fail to identify the support and 
the signs of /3 with probability at most of the order p""". These results are non-asymptotic and 
all our constants are explicit. 

The coherence assumption on the design matrix made in this paper is readily checkable. Many 
other currently used assumptions in the literature are based on concentration properties of the 
extreme singular values of all or most extracted submatrices of X with bounded number of 
columns. Yet, some other are based on the concentration of the singular values of the covariance 
matrix with respect to the covariate's underlying distribution. All such criteria are difficult or 
impossible to check in practice as opposed to the coherence property. 

We neither make any uncheckable assumption on the variance a^. The only unverifiable as- 
sumptions used in the present work are on the magnitude of the nonzero regression coefficients. 
As in [8], the set of regressors P which are correctly estimated is constrained by imposing that 
the magnitude of all nonzero components of P should be greater than the noise level. Moreover, 
for Strategy (B), our analysis requires the additional assumption that the components of P should 
not be too large either, the upper bound being in particular a function of C. This result suggests 
that Strategy (B) is pertinent in low SNR situations only. Simulation experiments at the end of 
this paper confirm the usefulness of Strategy (B) in the low SNR setting. 

1 .5 Plan of the paper 

The LASSO estimator, the main results Theorem 2.5 and Theorem 2.7, together with the assump- 
tions used throughout the paper are presented in Section 2. The proof of Theorem 2.5 is given in 
Section 3 and the proof of Theorem 2.7 in Section 4. The proofs of certain technical intermediate 
results are gathered in the Appendix. 
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1 .6 Notations 

1.6.1 Generalities 

When c {1, . . . , A^}, we denote by \E\ the cardinal of E. For / c {1,. . . ,p} and x a vector in 
W, we set ,x/ = {xi)i^i G M'^L The usual scalar product is denoted by (■, ■). The notations for the 
norms on vectors and matrices are also standard: for any vector x = (xi) e M^, 

II l|2 — 2 . II II _ I I . II II _ II 

II •^112 — / ^ -^i ) 11*^111 — / ^ l-^il ) ll-^lloo sup \Xi\. 

l<i<N l<i<N ^^^^^ 

For any matrix A E R'^^^'^^^ -^^e denote by A* its transpose. The set of symmetric real matrices 
in M"^" is denoted by S„. We denote by \\A\\ the operator norm of A. The maximum (resp. 
minimum) singular value of A is denoted by crmax(^) (resp. aram{A)). Recall that crmax(^) = ||^|| 
and, if A is invertible, auiin{A)~^ — We use the Loewner ordering on symmetric real 

matrices: if A g §„, ^ A is equivalent to saying that A is positive semi-definite, and A ^ B 
stands for ^ - A. 

The notations A/^(/i,cr^) (resp. x^{^) arid stands for the normal distribution on the real 

line with mean // and variance (resp. the Chi-square distribution with u degrees of freedom 
and the Bernoulli distribution with parameter u). 

1.6.2 Specific notations related to the design matrix X and the estimators 

For / c {1, . . . ,p}, and a matrix X, we denote by Xj the submatrix whose columns are indexed 
by /. We denote the range of Xi by Vj and the orthogonal projection onto Vj by Pvr 
The coherence /x(X) of a matrix X whose columns are unit-norm is defined by 

= max \{Xi,Xj)\. (1.3) 

As in [34], we consider the 'hollow-gram' matrix H and the selector matrix R — diag((5): 

H = X*X-Id (1.4) 
R = diag((5), (1.5) 

where 5 is a vector of length p whose components are i.i.d. random variables following the 
Bernoulli distribution B{s/p). In a similar fashion, we define Rs = diag{S^^^) where 5^^^ is a ran- 
dom vector of length p, uniformly distributed on the set of all vectors with exactly s components 
equal to 1 and p — s components equal to 0. 
The support of /5 is always denoted by T. 



2 The modified LASSO estimators 

In this section, we present the main results on the estimators given by Strategy (A) and Strategy 
(B), and we discuss the underlying assumptions. Practical computability of these estimators will 
be studied in Section 5. In particular "tuning A to A" is achieved by finding a zero of a function of 
A numerically. We will show in Section 5 that these zero finding problems are computationally 
very easy to solve. 

For any arbitrary value of a > 0, Theorem 2.5 (resp. Theorem 2.7), proposes a set of conditions 
under which exact recovery of the support and sign pattern of ^ holds with probability at least 
1 — 0(p~") for Strategy (A) (resp. for Strategy (B)). 

As will be shortly seen, the magnitude of the nonzero coefficients of f3 has to satisfy certain 
constraints: as in [8], one will require for both Strategies that the nonzero components of ^ 
are not too small (in fact, slightly above the noise level). In the case of Strategy (B), we will 
moreover require that the nonzero components of /3 are not too large. Although this upper 
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bound assumption may seem to argue in disfavor of Strategy (B), computational experiments 

will later show that this Strategy has much nicer empirical performance when the signal to 
noise ratio is small. The same computational experiments will also demonstrate that Strategy 
(A) performs almost as well as a standard LASSO which would know the variance. 



2.1 Definition of the estimators 

To define our estimators, we first need to work with matrices ensuring that the map A i-^ Px, 
where is given by (1.2), is well defined and enjoys special properties, such as continuity. 

Definition 2.1: The matrix X is said to satisfy the Generic Condition if 

\{xj,Xi{XjXi)-^Si)\ < 1, V5 e {-1, ly, V/ c {1, . . . ,p} s.t. Xj non singular and Vj ^ /. (2.6) 

As from now, we always work under the Generic Condition. We will use the following result 
about uniqueness of the LASSO estimator. 

Proposition 2.2 ([15]): Assume that X satisfies the Generic Condition. Then, for all y e R", and 
for all A e R+, Problem (1.2) has a unique solution Px and its support Tx is such that Xf^ is non 
singular. 

The following property is proven in Appendix C.2.1: 

Lemma 2.3: Let the Generic Condition hold. Then, almost surely, the map 

(0,+oo) — > W 



A ^ Px 

is bounded and continuous. Moreover, its £i-norm is non-increasing. 
2.1.1 Strategy A 

The estimator of strategy A is defined as /3 := where A verifies the implicit equation 



log p. (2.7) 



n 

The estimators A) being implicitly defined, it is not clear, at that point, that they exist. 

We will see in the sequel that a suitable choice of Cyar will ensure the existence of the estimators 
(under the above mentioned assumptions on X). 
The uniqueness follows by showing that the map : M+ — )■ M+ given by 

r.W := ^- 

logp ll!/ - A>/j.J|i 

is increasing, which is proven in Appendix C.2. 

Strategy A simply reduces to finding the value A^ such that Va{^a) = C'var- A precise range of 
interest for Cyar will be given in Theorem 2.5 below. Moreover, using the existence and uniqueness 
result, one can use a fixed point scheme to find A. This scheme is discussed in Section 5.3.1. 

Remark 2.4: Recall that in the known variance case, it is often assumed that 

X' = Cvar<7'l0gp, (2.8) 
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for some positive constant C^-,; see e.g. in [8]. In comparison, Strategy (A) enforces the choice (2.7). This 
is the empirical analog to (2.8). However, as will appear later in the proof of Theorem 2.5, instead of being 
an absolute constant, Cyar will have to depend on n, p and as follows 

P 

In the case of an i.i.d. Gaussian random design matrix, is of the order p/n with high probability. 
Thus Cvar can be basically seen as a constant in the Gaussian setting. 

2.1.2 Strategy B 

The estimator of strategy B is defined as ^ :— where A verifies the implicit equation 



A||/3a||i = C 



2 



2 



(2.9) 



Again, the estimators {P,X) are implicitly defined and their existence has to be proven. 



Compared to Strategy A, one specificity of Strategy B is that for any value of C > 0, existence 
and uniqueness of the estimators is garanteed, with no other assumptions than the Generic 
Condition. Indeed, we show here (cf Lemma C.3 in the Appendix) that the map Fb given by 

\\y-xpxM 

is increasing, continuous and rs((0, +oo)) = (0, +oo). Thus, there exists a unique value > 
such that Tb(Xb) = C. 
Similarly as for Strategy A, a fixed point scheme will be discussed in Section 5. 



2.2 Main results 

2.2. 1 Preliminary remarks 

The main idea behind the analysis of LASSO-type methods is the following. First, the ii penalty 
promotes sparsity of the estimator /3. Since /3 is sparse, we may restrict the study to the subvector 
of resp. the submatrix Xf of X, whose components, resp. colunms, are indexed by T. 

Taking this idea a little further, since T is supposed to estimate the true support T of cardinality 
s, the first kind of result one may ask for is a proof that Xt is far from singular for every possible 
T. Unfortunately, proving such a strong property with the right order in the upper bound on s, 
based on incoherence only, seems to be impossible. The idea proposed by Candes and Plan in 
[8] to overcome this problem is to assume that T is random and then prove that non-singularity 
occurs with high probability, i.e. for most supports. 

Based on this model, the method first consists of proving that Xt satisfies, for < r < 1, 

1 - r < a^^iXr) < (t^UXt) < 1 + r, (2.11) 

with high probability. The proof of this property in [8] is based on the Non-Commutative Kahane- 
Kintchine inequalities. In the present paper, we instead use a result of [11] based on a recent 
version of the Non-Commutative Chernoff inequality proposed by Tropp [35], in order to obtain 
better estimates for the involved constants. The most intuitive conditions to prove (2.11) are: 

(i) T is a random support with uniform distribution on index sets with cardinal s; 

(ii) s is sufficiently small; 

(iii) X is sufficiently incoherent. 
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The main part of the analysis consists of proving that the least-squares oracle estimator, which 
knows the support ahead of time, satisfies the optimality conditions of the LASSO estimator with 
high probability. This will prove that the LASSO automatically detects the right support and sign 
pattern. The proofs of these results highly depend on the quasi-isometry condition (2.11) and 
similar properties obtained with the same techniques as for (2.11). We also need the sign pattern 
of P to be uniformly distributed and jointly independent of the support of T. This assumption 
was already invoked in [8]. 

2.2.2 Assumptions and main results 

The first so-called Coherence condition deals with the minimum angle between the columns of 
X. 

Assumptions 2.1: (Range and Coherence condition for X) The matrix X has unit £2-norm 
columns, is full rank and its coherence verifies 



logp' 



for some numerical constant > 0. 



Assumptions 2.2: (Generic sparse model [8]) 

1) The support T of /5 is random and has uniform distribution among all index subsets of 
{1, . . . , n} with cardinal s, 

2) Given T, the sign pattern of is random with uniform distribution over {—1, +1}"*, and 
jointly independent of the support. 

The last condition concerns the magnitude of the nonzero regression coefficients /3j, j eT. Let 
a > 0, r e (0, |] and 

K = AVT+a. (2.12) 

Let us now define 

y/n + \/2a\ogp 1 — r 
^/s^ y/l + r' 

Let us introduce the function 

Ux) = xe"^"/^, x>0. 



^n,s,,p ^ 4 ^ (2.13) 



Since £q,((0,+oo)) — (0, +oo), the following constant Co := Co{a,r) is well defined: 

UCo) = 10e-3^^«;2 > 0. (2.14) 

It will appear in the number n of observations (explaining the index 'o'). 

We can now define the range assumption for the coefficients of /3 for Strategy (A). 

Assumptions 2.3: (Range condition for ^: Strategy (A)) The unknown vector ^ verifies 

min 1/3,1 > Hlf'^a. (2.15) 

Our main results show that the estimators (3 defined by either Strategy (A) or Strategy (B) 
recovers the support and sign pattern of /3 exactly with probability of the order 1 — {p~") using 
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similar bounds on the coherence and the sparsity as in [8]. 
As from now, let us choose r e (0, |] and set: 

= T^- (2.17) 

Theorem 2.5: Let a > and p > e^/". Let Assumption 2.1 hold with given above. Let 
Assumptions 2.2 and 2.3 hold with 

s < «o:=r^|^ (2.18) 
logp 11-^11 

n > s(Cologp + l). (2.19) 
Then the probability that the estimator ^ defined by Strategy (A) with 



L20(l+r)Cspar P" " ' 2(l+r)apar 

exactly recovers the support and sign pattern of /3 is greater than 1 — 228/p". 



(2.20) 



Remark 2.6: The choiCe of the constant 20 is unessential and the reader can check for himself 
which range is relevant for his own specific application. 
We now turn to Strategy (B). Let us define for C > 0, 



.n,s,p f J1 + 2C y/n- s + y/2a logp V2tt logp' 



n — s 1 / y/n(n — s)\" 



M^'rc = r L M ■ (2.22) 



Let us state the corresponding range assumption for the coefficients of 

Assumptions 2.4: (Range condition for ^: Strategy (B)) The unknown vector ^ verifies 

minl^il > L^a, (2.23) 



1 < M^';;^a. (2.24) 



Theorem 2.7: Let a > 0, p > e^/" and set Co = Choose C > 0. Let Assumptions 2.1, 2.2 

and 2.4 hold with 



' ^ r^^' (2.25) 
n > Co(l + 2C) slogp + s. (2.26) 



Then the probability that the estimator /3 defined by Strategy (B) exactly recovers the support 
and sign pattern of ^ is greater than 1 — 229/p°^. 
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2.3 Important comments 

2.3.1 About X 

The normalized Gaussian example is instructive. First, when X is obtained from a random matrix 
with i.i.d. standard Gaussian random entries by normalizing the columns, the coherence is of 
the order ^y\ogp/n (See below for a short proof). Therefore, taking n of the order of log^p is 
sufficient for satisfying the Incoherence Assumption 2.1. Second, it is also well known that 
is of the order p/n, see e.g. [30]. This suggests in particular that the upper bound (2.25) on the 
number s of nonzero components of ^ may be understood in the Gaussian setting as 

P C'spar I ^ 



< ^^pai _ Q 



logp WXW^ \logp 

This order of magnitude might be also valid for much more general random designs. 

Notice that the estimate ^J\og p/n of the coherence for i.i.d. Gaussian matrices with normalized 
columns easily follows from the Paul Levy concentration of measure phenomenon on the sphere 
and the union bound: Since, due to normalization, each column is Haar distributed on the unit 
sphere, rotational invariance implies that the scalar product of two column vectors Xj and Xji 
satisfies 

P {\{Xj, Xj,) \>u) = P (I (Xj, Xj,) \>u\ Xj,) < 2 exp {-cn v^) , 

for some constant c, by the well known concentration of measure phenomenon on the unit 
sphere. Thus, the union bound gives 



P 



(^^v^^^^\{Xj,Xj,)\>^^ < P^P-A.2e^p{-cnu'), 

< exp {—cn + 2 log p) . 



This last quantity is less that for u > c' sjlogp/n with c' = ■\J{a + 2)/c. 

An interesting question concerns the pertinence of the coherence for the problem of variable 
selection using the LASSO. The work of [37] shows through numerical investigations that certain 
conditions on the matrix X (requiring in particular the knowledge of the true signal's support, 
without any statistical assumptions on beta though), allow to deduce sharp bounds on the 
minimum sample size needed for exact support recovery. When the true support is not known 
ahead of time, conditions such as the ones in [37] are required to hold uniformly or at least for 
most support with high probability. Proving such a property for matrices more general than i.i.d. 
Gaussian matrices implies loosing sharp bounds on the minimum sample size. The advantage 
of the coherence over such assumptions relies in the fact that it can be computed very easily 
for any given matrix. The main drawback is that the resulting bounds on the minimum sample 
size might not be sharp. 

2.3.2 Order of H^^^^^ P 

In the case where X is i.i.d. Gaussian, the order of sq is n/ \ogp and thus the order of H^'^o *' is 
Vlogp, just as in [8]. Indeed, 



' ' n 



logp 
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2.3.3 About C and 

Increasing the upper bound on the magnitude of the ^/s via decreasing the constant C also 
results in increasing the lower bound. Therefore, C governs a sliding window inside which the 
coefficients of /3 can be recovered by the LASSO. Moreover for a given n, one can decrease the 
lower bound L^ *'^ in Eq. (2.21) by increasing C. This would result on a smaller sparsity in Eq. 
(2.26). Taking C as C ~ n/(slogp) implies the usual order ^/logp for the minimum of beta's 
(See Eq. (2.21)). If one wants to specify C in a way that is independent of s one may run the 
risk of prescribing an incorrect order for L^'^'^ as a function of n. This technical issue should 
however be considered as of theoretical interest only and not so much of a problem in practice. 
As an analogy, consider the plain LASSO with known variance: there exists a universal way of 
choosing the parameter A, but many practitioners use the LARS instead in order to explore all 
the supports occuring on the A-trajectory and compare them using a standard model selection 
procedure (AIC, BIC, Foster and George, etc). In the same manner, one could also vary the value 
of C and compare all supports on this trajectory. In this spirit, our simulation experiments show 
the histogram of recovered and incorrectly detected components over a large range of values of 
C. One nice surprise is that Strategy (B) is quite robust vs. the actual choice of C at such a low 
signal to noise ratio level. 

2.3.4 About the constants Csp^r and 

Let us compare the numerical values of these constants to the one obtained in [8]. 

One of the various constraints on the rate a in [8] is given by the theorem of Tropp in [34]. 
In this setting, 

a = 21og2 
r = 1/2, 

the author's choice of 1/2 being unessential. To obtain such a rate a, they need to impose the 
r.h.s. of (3.15) in [8] to be less than 1/4, that is: 

30C^ + 13y2CV < \- (2.27) 

This yields Cgpar < 119 x 10~^. Let us choose Cgpar close to this maximum allowed. Then, compute 
by (2.27). This yields 

Cspa. ^ 1.18 10-^ ~ 1.7 10-3. 

(The additional condition coming from the end of the proof of [8, Lemma 3.5], that is = 

21og(2), is not limiting since ^3/(128 log 2) > 1.7 lO'^.) 

Our theorem allows to choose any rate a > 0. To make a fair comparison, let us also choose 
a = 1.5 > 2 log 2 and r = 1/2. We obtain: 

Qpa. ~ 1.4 10-^ = 0.2. 

3 Proof OF Theorem 2.5 

The proof is divided into several steps. The main two steps are as follows. First, we provide the 
description and consequences of the optimality conditions for the standard LASSO estimator as 
a function of A. Second, we prove that these optimality conditions are satisfied by a simple and 
natural oracle estimator. 
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3.1 Enforcing the invertibility assumption 

We recall the basic result we proved in [11] regarding the invertibility of random submatrices 
via the Non-commutative Chemoff Inequality. 

Theorem 3.1: Let r e (0, 1), a > 1. Let X be a full-rank n x p matrix and s be positive integer, 
such that 

r 

2(1 + a) logp 

2 

r p 
4(1 + a)e2 ||X||21og;p' 

Let T c {1,. . .,p} be a set with cardinality s, chosen randomly from the uniform distribution. 
Then the following bound holds: 

21 6 

P(||X*,XT-Id,|| >r) < — . (3.28) 

By Theorem 3.1, we have 

(1 + r)-^ < \\ {X'tXt)'' \\ < (1-r)-^ (3.29) 
(1-r)'/' < ll^rll < (l + r)^/2 (3.30) 

with probability greater than 1 — 216 p""'. Thus, throughout this section, we will assume that 
(3.29) and (3.30) hold, i.e. we will reduce all events considered to their intersection with the 
event that (3.29) and (3.30) are satisfied. 

3.2 The oracle estimator for p and A 

We now discuss the next step of the proof of Theorem 2.5, which consists of studying some sort 
of oracle estimators for /3 which enjoys the property of knowing the support T of /3 ahead of 
time. _ ^ 

For a given A, one might like to consider the following oracle for ^: 

P e e^igmax -hy-Xb\\l-X\\b\U, (3.31) 
beB ^ 

where 

B = {be W, supp(6) = T, sgn(6) = sgn(/3T)}. 

However, it is not so easy to derive a closed form expression for /3. Therefore, it might be more 
interesting to consider instead the following oracle: 

p e argmax - hy - Xb\\l -X sgaiPrfb. (3.32) 

feeKP,supp(b)=T 2 

Indeed, /3 satisfies 

X'j.(y-XTPT)-Xsgn(l3T) = 0, 
and we obtain that /3 is given by 

Pt = (X*,XT)"'(x^l/-Asgn(^T)). (3.33) 



s < 
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This formula is the same as in the proof of Th. 1.3 in [8], but here, A is a variable. 

Now let us recall that in the known variance case, Candes and Plan assume that 

A' = Cvar<j'logp, (3.34) 
for some positive constant Cyar- It is then relevant to seek our oracle A as: 

A = Cvar log p. (3.35) 

n 

Replacing /3 by its value (3.33), we obtain 

Cvar h - {X'tXtY^ fey - A sgn(/5T)) \\l = Al 

Thus, 

C,^,\\Py^y + \XT[XtrXT) sgn(/5T)||^ = A^ 
and using the orthogonality relations, we obtain 

1 -ri . — - 

which is equivalent to 

P = ^v^^l^ (33^) 

cd^- II^T(^^^T)-^sgn(/3^)||i 

We henceforth work with this definition of A. Notice that A is well defined whenever 

77 

Cvar < ■ (3.37) 

- \\XT{Xi,XT)-'sg-^{PT)\\l logp 

The choice of Cvar will be done in the next section. 
3.3 Study of the oracle A 

In this section, we provide a confidence interval for A. In particular, the first subsection shows 
that A is well defined. 



3.3. 1 Bounds on \\Xt (X^tXtT^ sgn(^T) ||i 

Using the lower bound on crinin(-'^T) and the upper bound on cTmaxl-'^T) given by (3.29) and (3.30), 
we have, with high probability: 

s< WXTiX'TXTr'sgniPrm < ^- (3-38) 



We write the choice of Cvar made in (2.20) as 

'^'-'^^ " < Cvar < l ^'r'^^ : , (3.39) 



20 1 + r So logp 2 1 + r sq logp 

where sq is the maximum sparsity allowed in Inequality (2.18), namely, 

P Cgpar 

logp ll-^lp 

In particular, the condition (3.37) is satisfied which garantees that A is indeed well defined. 
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3.3.2 Bounds on llPvx-zlU 

II Vj, II 

Using some well known properties of the distribution recalled in Lemma B.l in the Appendix, 
we obtain that 

F (^\\Py±{z)\\2/a > Vri^+V2tj < exp(-t) (3.40) 

and 

^iWPv-ml/'y' <u{n-s)) < ^ A n . (3.41) 

Tune u such that the r.h.s. of (3.41) equals i.e. 

/ I \ 4/(n-s) 

u 



pa 



Thus, we obtain that 

llPvX^ 



and 



1^ < {V^^s + ^21og(^)j < (v/^T^ + v/2^d^)' (3.42) 



with probability greater than or equal to 1 — 2p~". 

3.3.3 Bounds on A 
Lemma 3.2: The following bounds hold: 

A < , 2^ Vn5Z±^2alogp ^3_^^^ 
VI + r 

A > K cT yiogp. (3.45) 
Proof: Recall that < s < sq. From (3.39), we have 



2 1 + r So log P 
We then obtain, by virtue of (3.36) and the upper bound in (3.38), 



A^ < 



|2 



2^0 - \\Xt {X'tXt)-' sgniPr) 111 



llPy^-^lli 



Using the bound (3.42), we deduce (3.44). 
On the other hand, the bound (3.43) and 
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yield 

A" > 



From (2.19), we know that n verifies 



n-s n-so 

> > C^ologp. (3.46) 



So Sq 

Thus, iiotiiig that (7r(n - > 1, 

10e(l + r) 

Writing = e-^"i°gf/("-^) and, using (3.46) again, 

logp ^ logp ^ 1 ^ 1 



n — s ~ n — So ~ So Co ~ Co^ 
we obtain 

^-4a/(n-s) g-4a/Co _ 

Therefore, 



> -^i-^e-^«/^=CoaMogp. (3.47) 
lOe 1 + r 



Let us recall that the constant Co has been precisely chosen to satisfy 

4(Co) = Coe-^"/^° = lOe 7:^^^«:'. 

(1 — rj^ 

As a conclusion, we have just proved (3.45). □ 



3.4 Candes and Plan's conditions 

To obtain the exact recovery of the support and sign patterns of /3, we will need similar bounds 
as the ones in [8, Section 3.5]. Namely, 

(i) ||(X^Xt)-ix^z||oo < « 

(ii) ||(X*.XT)-^sgn(^T)||oo<3 

(iii) ||X*.eXT(X^X^)-isgn(^T)||oo < i 

(iv) ||X*.c (Id - Xt(X*.Xt)-^X*.) ^IIoo < « ay/]^ 

(v) \\X^XT-lds\\ < r. 

When r = ^, these conditions were proven to hold with high probability in [8] based on 
previous results due to Tropp [34]. Most of the proofs that these conditions hold with high 
probability are the same as in [8] up to some slight improvements of the constants. 



Proposition 3.3: The bounds (i-iv) hold with probability at least 1 — 10/p°'. Condition (v) holds 
with probability at least 1 — 216/p". 

Proof: See Section A in the Appendix. □ 
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3.5 Last step of the proof 

We now conclude the proof using the strategy announced in the beginning of this section: 

(i) We prove that the proxies ^ and A satisfy the optimality conditions (C.89) and (C.90), from 

which we deduce^that (5 = (5 and A = A. ^ 

(ii) Since the proxy /3 has the right support and sign patterns, we conclude that j3 exactly 
recovers these features as well. 

3.5.1 p and p have the same support and sign pattern 

First, it is clear that f3 and f3 have the same support. Next, we must prove that /3 has the same 
sign pattern as /3. Use Proposition 3.3 to obtain 

0t-I3t\\oo < \\{X'j,XT)-'X'j,z\\^ + X\\{X'j,XT)-'sgn{PT)\\oo 
< K ay^logp + 3A. 

Using the lower bound (3.45), and the expression of k, we obtain 

Wt-Moo < 4A. (3.48) 

A sufficient condition to guarantee that the sign pattern is recovered is that this last upper bound 
be lower than the minimum absolute value of non-zero components of f3, i.e. 

4A < mill 1/3 J. (3.49) 
Using the upper bound on A in (3.44), this is achieved in particular when 



4(7 ^ , < mm Lo,- , 



which is implied by Assumption 2.4. 

3.5.2 p and A satisfy the optimality conditions 

Using the lower bound (3.45) on A, the proof of the fact /3 and A satisfy the optimality conditions 
is exactly the same as in [8, Section 3.5]. We repeat the argument for the sake of completeness. 
On one hand, by construction, we clearly have 

XUy-XP) = -XsgniPr). 
Since ^ and ^ have the same sign pattern, we actually have: 

X'j.{y-XP) = -Asgn(3T). 

On the other hand, 

\\XUy-Xmoc = ||X^.Pyx(^) + A X^.XT(X^XT)-'sgn(/3T)||oo 

< \\X'tcPv^{z)\\oo + X ||X*.eXT(X*,XT)-^Sgn(/3T)||oo 

< K (7-\/l0gp+ -A 

< 7 A < A. 
4 



(3.50) 



Hence, the two parts of the subgradient conditions (C.89-C.90) are satisfied by ^ and A, which 
means that 

P = h- (3-51) 



17 



In other words, ^ corresponds to the solution of problem (1.2) with the penalization A = A. 
Moreover, A has been determined so that it verifies (3.35) 

A = Cvar logp, 

n 

i.e., plugging (3.51), 

A = Cvar log p. 



n 



Therefore, A is a solution of Eq. (2.7). By virtue of uniqueness proved in Appendix C.3, we 
deduce that 



A = A. 

3. 5. 3 Conclusion of the proof 

The two preceding sub-sections prove that l3 has same support and sign pattern as /3. This 
occurs when (3.29) and (3.30) (both implied by the invertibility condition (v) in Sec. 3.4), Candes 
and Plan's conditions (i-iv) in Sec. 3.4 and the bound on ||Pyx2;||2 in Sec. 3.3.2 are satisfied 
simultaneously. Therefore, this occurs with probability at least 

216 + 10 + 2 



1 

as announced. 



pa 



4 Proof of Theorem 2.7 

As in the proof of Theorem 2.5, the quasi-isometry property (3.29) and (3.30), and Candes and 
Plan's conditions of Section 3.4 will be assumed. Notice also that the results of Section C.l are 
still valid with the assumption of Theorem 2.7. 

4.1 The oracle estimator 

As in the case of Section 3.2, the oracle for ^ is given by 

= (X^XT)"'(x^y-Asgn(/3T)). (4.52) 

We now seek A verifying 

^Wv-XtMI = CXsgn{/3T)%. (4.53) 
Replacing P by its value (3.33), we obtain 

^ ||l/-XT(X^Xr)"'(x^i/-Asgn(/3T))||^ 

= C A sgn(/3T)* ( W^t)"' {X'^y - A sgn(/3T))) . 



Thus, 



-CX\sgn{l3T), (X^Xr)"'sgn(/3T)) +CA sgn(/3T)* (X^Xt)"' X^y. 
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Using the orthogonality relations, we then obtain 

-C A^(sgn(^T), (X*.Xr)"'sgn(^T)), 

which is equivalent to 

^ + Al sgn(^T)\\l - CA sgn(^^)* (X^X^)-' X'^y + ^||Pv^.^||^ = 0. (4.54) 

The roots of the quadratic equation are 

3^ ^ C sgniPrnX^TXT)-' X'^y ± VA ^^^^^ 
(l + 2C)||(X*,XT)-^sgn(^^)||i ' 

where 

2 



A = (Csgn(/3T)*(X^XT) 'X^y' 

-(1 + 2C)|| {X'^XT)-Kgn{^T)\\l\\Pv-z\\l. 



4.2 Study of the oracle A 

Following the same strategy as for Strategy (A), we now provide a confidence interval for A. 

4.2. 1 Premilinaries 
We have 

,-1 ^rt , n \-l 



sgn(/3T)*(X^XT)" = sgn(/3T)*(X^XT)"'X^(Xr/3 + z) 

= sgn(/3r)*/3 + sgn(/3T)*(X^XT)"'x* 
= + (Xt (X*.XT)"'sgn(/3T),Py,^ + Pv-^). 

Hence, 

sgn(/3T)* (X^Xt)"' X\,y = + (Xt {X'tXt)'^ sgn(/3T), Py,-^). (4.56) 
Note that the Cauchy-Schwarz inequality yields 

{Xt (X^Xr)"'sgn(/3r),Py,;2)| < || (X^Xt)"^ sgn(/3r)||2||Pv,;2||2. (4.57) 
4.2.2 Bound on \^vtAi 

Using some well known properties of the distribution recalled in Lemma B.l in the Appendix, 
we obtain 

P(^||Pvv(-z)||2/(J> Vs + v^) < exp(-i). (4.58) 
Tune t such that e~* = 2p~", i.e. 

t = log(p72). 

Hence, 

p(||Pvt('2)I|2/^t> V21og(pV2)) < P"". (4.59) 
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4.2.3 Positivity of A 

We begin with the study of sgn{/3TY {X^^Xt)'^ X^y and || sgn(/3T)||i||Pv'X^|||, two 

key quantities in the analysis. 



We first study sgn(^r)* (X^^t)"^ ^t?/- % (3.29), we have 

||(X*.X^)-^sgn(/3T)||2 < 



1 -r 



(4.60) 



Thus, using (4.59), (4.57) and the lower bound (2.23) from Assumption 2.4 on the non-zero 
components of we can write 



{Xt{X'tXt) 'sgn{/3T),PvrZ) 



Therefore, from (4.56) we deduce that 



< a 



2 a log 



Second, we study || (X^Xt) ^ sgn(/3T)||i||PvcJ-;2||i. We have 

||(X^Xr)"^sgn(/3T)||2||Py^^^||2 < ^^^37 (x^:^ + V2alogp) 



Thus 



> — s2 mill |/3,f - (7'(1 + 2C)-^ ( ^J^^s + A/2Q;logpV 

4 i<i<p 1 — r V / 

and Assumption 2.4 shows that A > 0, which ensures that A is well defined. 



(4.61) 



(4.62) 
(4.63) 



4.2.4 Bounds on A 
First, let us write 



y/A^(c sgniPrY (X^X^)"' X'^y 



X 



1 - 



(1 + 2C) II (X*.XT)-^sgn(/3T)||i||Pyxz| 



On one hand, due to y/1 - S < 1 - | on (0, 1), we obtain 

VA < (Csgn(/3r)*(X^XT)"'x^y) 

(1 + 2C)|| (XfX^)-^ sgn(/3T)||i||Pv.x^||i 
2C sgni/SrY (X'^Xt)-' X'^y 
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Combining this last equation with (4.55), we obtain that 



A > ^ . . (4.64) 

- 2Csgn(/3T)*(X^XT)"'X^y 



On the other hand, we also have \/l — 5 > 1 — 5 on (0, 1). Thus we can write 

(1 + 2C)\\ (X^Xt)-^ sgn(/3^)||^||P^x;.||i 

and combining this last equation with (4.55) and the previous upper bound, we thus obtain 

A < 



Using (4.61), we jfinally get 



3C||/3||i - - cm 

Combining this last equation with (3.43), we obtain 



< A < 2 ' • (4.65) 



,2 



4 



3cpK - - cii^ll 

Using Assumption 2.4 and (2.22), we thus obtain 



< A < 2 a' (4.66) 



A > «; (T^logp. (4.67) 

4.3 Last step of the proof 

4.3. 1 p and fi have the same support and sign pattern 

As in the case of Strategy (A) it is clear that ^ and /3 have the same support. Let us now verify 
that they have the same sign pattern. 
As in Section 3.5.1 and based on (4.67), we obtain 

IIAt-^StIIoo < 4A, 

exactly as for Strategy (A). Using the upper bound on A in the right hand side of (4.66), we thus 
need 

C j€T cr^ 

to garantee that /3t and /3t have the same sign pattern. In view of this inequality, and since 
11/3 111 > smiiijgT an even stronger sufficient condition is 

,2 



{y/n- s + y/2a\ogp) mirij-gr 

o 7Z ^ 



2 



CS - (72 ■ 

Noting that ^ < 2^^=, we conclude that this condition is also implied by Assumption 2.4. 



21 



4.3.2 j3 and X satisfy the optimality conditions 

The proof is exactly the same as in Section 3.5.2 after replacing (3.45) by (4.67). 

4.3.3 Conclusion of the proof 

The two preceding sub-sections prove that /3 has same support and sign pattern as /3. This occurs 
under the same conditions as those mentioned in the conclusion of the proof of Theorem 2.5, 
Sec. 3.5.3, plus the bound on ||Pyj,2;||2 in Sec. 4.2.2. Hence, this occurs with probability at least 

^ 216 + 10 + 2 + 1 

pa 

as announced. 



4.4 Epilogue: Nonempty range for 

We need to ensure that the range of admissible values for /3 is sufficiently large. The intuition 
says that this can be achieved by allowing sufficiently large values of n. In other words, we 
would like to know the additional constraints on the various parameters ensuring 

I n,s,p hj\n,s,p 

It then suffices to know when the following inequalities are satisfied: 



where 

ma,r,C = 6/t ■ 

^yl — r 

First, notice that under the condition 

n — s > 8a s logp > Salogp, (4.69) 

4 

we have log ^ V^^^-^) ^ ^ _i_ (iog(7r) + log(n - s)) - a logp^ > -i, and then 



4 

n — s 



Therefore, since we also have \/2alogp < ^n — s, (4.68) is fulfilled if 

Vlogp 

i.e. 

n — s > 4e ^ c P- 
This explains the constraint (2.26) with the constant Co := 4e m^^^ > So;. 
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5 Algorithms and simulations results 

In this section, we propose one iterative algorithm for Strategies (A) and (B) and we study their 
practical performance via Monte Carlo experiments. 

We performed Monte Carlo experiments in the following setting. We took p = 600, n = 75 
and s = 9 and we ran 500 experiments with cr^ = 1 and the coefficients of (5 were randomly 
drawn independently as B times a Bernoulli ±1 random variable plus an independent centered 
Gaussian perturbation with variance one. 

5.1 Preliminaries 

Our algorithms will be well defined under the assumption that for each positive value of the 
relaxation parameter, the value (5\ of the regression vector is unique and the trajectory of (5x is 
continuous and piecewise affine. This property is well known under various assumptions on 
the design matrix X. It is a basic prerequisite for the theory behind Least Angle Regression and 
Homotopy methods. We refer the reader to [26] and [16] for information on these problems. See 
also [15] for a recent account on the study of I5\ as a function of A under generic conditions on 
the design matrix. 
The subgradient conditions for the LASSO imply that 

= Asgn(^^J. (5.70) 

where X^^ is non-singular, and we obtain the well known fact that, for any A > such that 

= (X|.^X^J-^(x|,^|/-Asgn(AfJ). (5.71) 

The following result is straightforward but useful. 

Lemma 5.1: (N on triviality of the estimator) Let S be the set 

E={(5,5); 5c{l,...,p}, 5 e {-1,1)1^1, \S\<n, a^^{Xs)>Q}. (5.72) 

The inequality 

mi \\{XlXsr\xy-\5)\l > (5.73) 

holds with probability one. 

Proof: This is an immediate consequence of the Gaussian distribution of z. □ 

5.2 The standard LASSO with known variance 

5.2. 1 Simulations results: high SNR 

With the choice S = 40, in all of the 500 experiments, we found that the support was exactly 
recovered. 

5.2.2 Simulations results: low SNR 

Figure 5.2.2 below shows the histogram of the number of properly recovered components (left 
column) and the number of false components (right column) for the LASSO estimator with 
known variance and A = 2a^/2\ogp. 
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Figure 1. Histogram of the number of properly recovered components (left column) and the 
number of false components (right column) for the LASSO estimator with known variance and 
A = 2a^/2\ogp for coeff. mean level 5 = 1,2, 5, 10 (from top to bottom) 



5.3 Strategy (A) 

5.3.1 The algorithm 

As was discussed in Section 2.1.1, finding the estimator A) in Strategy A is equivalent to 
solving the equation 

ryl(A) = Cvar- 

Since the function Ta is increasing (see Appendix C.2), there is a number of Newton-type 
methods which can be used to solve this equation very efficiently and globally, i.e. without any 
condition on the initial iterate A*^°); see e.g. [28]. Instead of such a refined method, we propose 
below a simpler fixed point iteration which was observed to work very well in practice. 

Notice that the first step of the fixed point iteration procedure is similar to the correction of 
the standard estimator of a proposed by [31]. 

5.3.2 Simulations results: high SNR 

As for the case of the standard LASSO with known variance of Section 5.2.1 we found that, for 
B = 40, the support was exactly recovered in all of the 500 experiments. 
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Algorithm 1 Fixed point iterations for the LASSO with unknown variance 

Input A(o), / = 1 and e > 
while |A('+i) - A«| > 5 do 

Compute /3x{i) as a solution of the LASSO problem 

^,(0 e argmin^||7/-X6||^ + A«||6||i (5.74) 

Set Xii+^) = ^\\y - X^%h 
end while 

Set AW = A(^), ^(^) = ^,(.) and a(^) = - X^l?) h- 
Output and A^-^). 



5. 3. 3 Simulations results: low SNR 

We performed Monte Carlo experiments in the same setting as for the LASSO in Section 5.2.2. 

In real situations where the level of magnitude of the regression coefficients may not be much 
higher than the noise level, one observes that false positives often occur for the LASSO estimator 
with known variance. As seen from these results, the LASSO estimator where the variance is 
estimated using the penalty A = 2a-y/2 logp performs at least as well as the standard LASSO 
estimator to which the true variance is available. The estimator a of the standard deviation is a 
slightly biased as shown in Figure 5.3.3. 



5.4 Strategy (B) 

5.4. 1 The algorithm 

Basic computations show that the derivative of F with respect to A is given by 

dT^ ^ -||%lli-A||(X^X^J-^/^sign(4) 111 

^A \\y-xA\\l 

on each Ik, where Ik, is a maximal open interval on which the support of /3a is constant, for 
/c G /C and UkeKh is a connected interval of [0, +oo). See for instance [12, Section 4]. 

In order to compute the LASSO estimators (/?, A) satisfying the penalty vs. fidelity tradeoff 
constraint, we need to find A such that Tb{X) = C. Since F is strictly decreasing by Lemma 
C.3, this task is not difficult to perform. A simple Newton-Raphson procedure for solving this 
equation is summarized in Algorithm 5.4.1 below. 



5.4.2 Simulations results: high SNR 

As for the case of the standard LASSO with known variance of Section 5.2.1 we found that, for 
B = 40, the support was exactly recovered in all of the 100 experiments. 

5.4.3 Simulations results: low SNR 

We performed Monte Carlo experiments in the same setting as for the LASSO in Section 5.2.2. 

Figure 5.4.3 below shows the histogram of the number of properly recovered components 
(left column) and the number of false components (right column) for the LASSO estimator with 
unknown variance and the penalty vs. fidelity tradeoff constraint for the values C = .01, .1, .5. 
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Figure 2. Histogram of the number of properly recovered components (left column) and the number 
of false components (right column) for the LASSO estimator with unknown variance using Strategy 
(A) and A = 2a^/2Togp for coeff. mean level 5 = 1,2, 5, 10 (from top to bottom). 



The instances where Newton's iterations did not converge were simply discarded although 
implementing a line search or a trust region strategy could easily have produced a correct result 
at the price of increasing the computational time for the Monte Carlo simulations study 

The number of well recovered components of /3 is always equal to the true value 9 as C 
increases for all values of C. On the other hand, the number of false positives increases with 
C. Our estimator with penalty vs. fidelity tradeoff constraint is seen to have quite better perfor- 
mances than the standard LASSO and LASSO with estimated variance of the previous section 
with respect to the number of false positives; compare Figure 5.4.3 with the second row of Figure 
5.2.2 or Figure 5.3.3. This was the main objective for proposing this strategy and the presented 
simulations show encouraging evidence of its robust behavior in the low SNR case. The low 
dependency on C is a property which might be well appreciated in practice when neither the 
signal nor the noise levels are precisely known ahead of time. 

5.5 Comments 

The simulations results confirmed the theoretical findings that, in the high SNR case. Strategy (A) 
and Strategy (B) perform as well, without knowing the variance ahead of time, as the standard 
LASSO which uses the true value of the variance. Although the results are presented for a 
particular set of parameters, this behavior was observed more generally for a large number of 
numerical experiments with different parameter configurations, for which the standard LASSO 
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Figure 3. Histogram of a for the LASSO estimator with unknown variance using Strategy (A) and 
A = 2CTy21ogp for coeff. mean level B = 1,2, 5, 10 (from top to bottom). 



Algorithm 2 Newton's method for the LASSO with penalty vs. fidelity tradeoff constraint 

Input A(°), / = 1 and £ > 
while |A('+i) - A«| >e do 

Compute /3x{i) as a solution of the LASSO problem 

^,(0 e argmin i||y-X6||2 + A«||6||i 



(5.75) 



beRP 



Set A('+i) = A« 



^(A(O)-i r(A«) 
dA 



end while 

Set A(^) = A(^), = ^^^L) and let a^^)' be given by 



a 



n 



(5.76) 



Output and A^^). 



exactly recovers the true support and sign pattern. In the low SNR setting, the standard LASSO 
and Strategy (A) perform poorly in the sense that many false components are selected. The Monte 
Carlo experiments show that Strategy (B) is more robust in the low SNR setting, in the sense 
that the estimated support contains much less false components. Surprisingly, this phenomenon 
was observed over a wide range of values for the constant C. In other words, the dependence 
of Strategy (B)'s performance on C appeared as rather unessential for the recovery problem in 
the low SNR setting. As a preliminary practical conclusion. Strategy (A) appeared to be more 
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Figure 4. Histogram of the number of properly recovered components (left column) and the number 
of false components (right column) for the LASSO estimator with unknown variance using Strategy 
(B) for C = 0.01, .01, 0.5, 1, 5, 10 (from top to bottom) with level B = 5. 



suitable for the high SNR setting and Strategy (B) more suitable for the low SNR setting. In 
practice, the choice of C in Strategy B could be based on standard model selection procedures 
(AIC, BIC, Foster and George, etc) for comparing the obtained supports over a large range of 
possible values. The limited number of possible supports occurring in practice as C varies makes 
this comparison numerically tractable. 

Another interesting question is the one of the convergence of the algorithms proposed for 
Strategies (A) and (B). From the practical viewpoint, let us report that convergence was observed 
except in rare cases during the Monte Carlo experiments. However, we decided not to pursue 
their theoretical analysis here, since more robust methods enjoying global convergence have been 
proposed in the literature during the last twenty years. We refer the interested reader to e.g. [28] 
for a globally convergent damped Newton method. Such methods are however more delicate to 
implement and the algorithms proposed in the present paper seem to be a good choice to start 
with in most practical experiments. 

Finally, there remains the question of choosing between Strategy A and Strategy B on a given 
practical problem. One reasonable way to proceed might simply be as follows: compare the 
supports obtained via both methods, using a standard model selection procedure such as BIC, 
AIC, Foster and George's criterion, etc. 
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Appendix A 

Proof of Proposition 3.3 

First, let us recall a technical result we obtained in [11]: 
Lemma A.l: The following bound holds: 

12 X 



F{\\RH\\i^2>v) < p(e^^) , (A.77) 



provided that e--'^-^ < 1. 
Let us introduce the events: 



E = {||X^Xt -Id|| < r} 
B = <! WRHh^o < 



The proofs that Conditions (i) and (ii) hold with high probability are trivial modifications of 
the ones given in [8] up to the constants. The proofs that Conditions (Hi) and (iv) hold with 
high probability can be performed using the following by-product inequality from our Lemma 
A.l: 

P(S^) < pexp(^\og(e^) logpV (A.78) 



instead of using [8, Lemma 3.5] and [8, Lemma 3.6]. Here, we take 

> max(e2Cspa. ;(1 + q;)C^), (A.79) 

so that 

F{B^) < ^. (A.80) 

All the proofs are moreover based on the simple inequality 

P(^) = ¥{AnEnB)+F{An{E''UB')) 

< E [P I i?) lEns] + HE') + nB''), 
and the bound, for a given vector W: 

F{\{W,X)\>t) < 2e-*'/(2||w'||i)_ (A.81) 

This last bound holds true for sub-Gaussian random vectors with independent components 
having Bernoulli or standard Gaussian distribution, for instance. 

A.1 Condition (i) 

Here, let Wi be the ith row of (X*.Xt)-^X*.. Since {Wi,z) - M{Q, \\Wi\\l), we have from (A.81) 
and the union bound: 



P (umx\{Wi,z) \ >?j < 2s e-*V{2ma.,||W^,||i)_ 



Note that on E: 

max||Wi||2 < IKX^Xt)"^ IIX^II < + (A.82) 
ieT 1 — r 
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One then obtains 

2 



whenever 



. > + + (A.83) 

1 — r 

A.2 Condition (U) 

Let us show that the estimate (ii) holds with high probability. This is an actual consequence of 
our Lemma A.l. 

First, as in [8] p.2171 and Lemma 3.3 p.2166, we write the inequality 

||(X^Xr)-V(/3T)||oo < l + max|(H^„sgn(/3T))|, 
where Wj is the ith row or column of (XI^Xt)'^ — Id. Set 



A = <!max|(W,,sgn(/3r))| >2[>. 

is J 



Hoeffding's inequality yields: 



¥(A\R) < 2|J|exp — . (A.84) 



As in [8] p.2171 and p.2172, we write WWih < Thus on E: 

\\Wi\\2 < ^ < . 

1 — r 1 — r 

Recall that P(£'') < ^ since c satisfies (A.79). Moreover 

E[P(^| i?) W] < ^ 



holds true if 



< 2(1-'-) 



1 + a 

We can easily check that this last condition is compatible with (A.79) and 

r 

1 + a 

^^P"^ ~ (1 + Q;)e2' 
whenever r e (0, 1/2). Therefore, when r e (0, 1/2), the event 

||(X^XT)-^sgn(/3T)||oo<l + 2 = 3 
holds with probability at least 1 — 
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A.3 Condition {Hi) 

Here, Wi = (X^^XTy^X^^^Xi. Notice that onEnB: 

max\\W42 <jz ^^=- (A.85) 

leTc (1 - r)^/\ogp 

Using (A.80) again and the same previous arguments, we obtain 

¥ (^\\X'^.XT{X'^XT)-'sgn{PT)\\oo < \^ > 

A.4 Condition (iv) 

If one now sets Wi as the ith row of Id — XT{Xi^XT)~^Xi^ and note that on E for any i eT: 

\\Wi\\2 < \\Xi\\2 = 1, (A.86) 

then: 

p(||X^c(/-Xr(X^Xr)-'X^)^||oo <(T «: 0ogp) ^1"^' 

whenever 

K > ^y2{l + a). (A.87) 

A.5 Choosing k 

The parameter k has to satisfy (A.83) and (A.87). Since r e (0, i], one has ^ 2\/3 3.4. 

Thus we simply chose 

which is Eq. (2.12). 

Appendix B 

Some properties of the distribution 

We recall the following useful bounds for the x^i'^) distribution of degree of freedom v 
Lemma B.l: The following bounds hold: 



2 



Proof: For the first statement, see e.g. [24]. For the second statement, recall that 

Jo 



'0 

-dt. 



Since maxtgK+ t"e * — (ct/e)" and is attained att — a, we obtain that 



■a 
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Take for instance a = | and obtain 



(z//4e)4 ^ V 

On the other hand, we have 



r(^) > 



2 



and then. 



Hence, 



(z./4e)f ^ 72(e/2)f(|)- 



as desired. □ 

Appendix C 

Properties of the standard LASSO 

C.1 Reminders on the LASSO subgradient conditions 

In [18] Section III, it is proven that a necessary and sufficient optimality condition in (1.2) is the 
two following conditions: 

X\.{y-XK) = Asgn(^r) (C.89) 
||X*,e(y-X^A)||oo < A. (C.90) 

Moreover, if — X^a)||oo < A, then problem (1.2) admits a unique solution. 

Let us also recall (see [15] and [12]) that the support T\ c {1, . . . ,p} of ^\ verifies 

\fx\ < n. (C.91) 

C.2 General properties of A i-^ 3a 

Recall that Px is the standard LASSO estimator of ^ parametrized by A, 
The following notations will be useful. Define jC as the cost function: 

(0, +oo) xRP — > R+ 

^ ' ^ i\b) ^ l\\y-xb\\l + x\\b\\u ^^-^^^ 



and for all A > 0, 



9(X) = inf>C(A,6). 



Lemma C.l: Let the Generic Condition hold. Then, the function 9 is concave and non-decreasing. 

Proof: Since 6 is the infimum of a set of affine functions of the variable A, it is concave. 
Moreover, we have 
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where, by Proposition 2.2, ^ is the unique solution of (2). Using the jfilling property [20, Chapter 
XII], we obtain that 89 (X) is the singleton Thus, 9 is differentiable and its derivative at 

A is given by 

Moreover, this last expression shows that 9 is nondecreasing. □ 
C.2. 1 Proof of Lemma 2. 3 

(i) ||/3a||i is non-increasing - The fact that A i — > \\/3\\\i is non-increasing is an immediate conse- 
quence of the concavity of 9. 

(ii) Boundedness - Notice that using (5.71), we obtain that 

II^aIIi < max ||(X*Xs)-i(X*|/-A5)||^. 

Thus, A I — > is bounded on any interval of the form (0, M], with M e (0, +oo). Moreover, 
since its £i-norm is non-increasing, it is bounded on (0, oo). 

(iii) Continuity - Assume for contradiction that A i — > (5\ is not continuous at some A° > 0. Using 
boundedness, we can construct two sequences converging towards (5^o and (i'^o respectively 
with [5^o 7^ Pxo- Since £(A°, •) is continuous, both limits are optimal solutions of the problem 

argmin £(A°,6), (C.93) 

hence contradicting the uniqueness. 
C.2.2 Partitioning (0, +oo) into good intervals 

The continuity of A h-)- /3a implies that the interval (0, +oo) can be partitioned into subintervals 
of the type 4 = (Afe, Afc+i], with 

(i) Ao = and A^ e (0, -Foo] for A; > 0, 

(ii) the support and sign pattern of /3a are constant on each 1^. 

Notice further that due to Step l.a, T\ ^ ^ on at least Iq. Let K, be the nonempty set 

K = {A; e N, VA e 4, 3"a 7^ o} . 

On any interval 4, k e IC, uniqueness of /3 implies that the expression (5.71) for /3f^ holds. 
Multiplying (5.71) on the left by sgn (^Pf^ , we obtain 

II^aIIi = sgn {X%X^yx'^y - Asgn {X%X^y\g^ (%) . 

Thus 

d\ 

on (0, -Foo). Thus, the definition of E, we obtain that 

^(A) < - inf b\X%XfY''8 < (C.94) 
dX (5,<5)es 

on each Ik, k e K. and 

^II^aIIi 



(A) = -sgn(%)*(XlX^J-isgn(%), 



dX 



"(A) 
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on each 1^, k ^K,, i.e. on each 1^ such that = for all A in 1^, if any such Ik exists. Since 

A I — )■ is continuous on (0,oo), (C.94) implies that: 

(i) there exists r G (0, +oo), such that = (as an easy consequence of the Fundamental 
Theorem of Calculus and a contradiction). 

(ii) /3a = for all A > r. 

Hence VJkeKh is a connected bounded interval. 
C.2.3 1 2\ I = n for A sufficiently smali. 

Let (Afc)fceN be any positive sequence converging to 0. Let /3* be any cluster point of the sequence 
{l3\JkeN (this sequence is easily seen to be bounded under various standard assumptions; see 
e.g. [12, Lemma 3.5] for a proof). Fix e > and b G MP. For all /c G N, we have 

J^i^kM < C{Xk,b), (C.95) 

where jC is defined by (C.92). Since jC{Xk, •) is continuous, we can also write for k sufficiently 
large: 

Hence, C{Xk, l3*) < C{\k, h) + e. Letting A^ — )■ 0, we obtain 

h\y-xni < l\\y-xb\\l + e, 



and thus. 



l\\y-XI3*\\l < ml^\\y-Xb\\l (C.96) 



Since range(X) = M", (C.96) implies \\y - Xl3*\\l = 0, and then 



lim\\y-Xp,\\i = 0. (C.97) 

A4-U 

Notice further that {b G W, |supp(6)| < n} is a finite union of subspaces of MP, each with 
dimension n — 1. Thus, 

m:= inf -\\y - Xb\\l > 0, 

{66MP; |supp(6)|<n} 2 

with probability one. Therefore for A sufficiently small, (C.97) implies \\y — X^xWl < from 
which we deduce that \Tx\ —n since one has \Tx\ <n (cf Reminder C.l). 

C.2.4 The map A h- |||/ - X^xh is increasing on (0, r] 
Using (5.71), we obtain 

y-XPx = Pvx(y)-AX^j4^X^J-^sgn(%), 

which implies that 



\y-x(3x\\l 



- 2A(P^j_^(y),X^,jXl X^J-^sgn (%)) 
+X' sgn {XlXfJ-'sgn (^^J 
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and thus, by the definition of Py± (y), 



\y-x(3,\\l 



X' sgn {^^y {X'^X^ysgn (^^J . (C.98) 



From (C.98), since {Xi,^Xf^) ^ is definite, we obtain that A i-^ — X^;^||2 is increasing on each 
Ik, and thus on (0, r] by using that A i-^ ||y — ^^a||2 is continuous on (0, r]. 



C.3 Study of 

Lemma C.2: Fa is increasing on (0,t] and liniA-^+oo T^IA) = +oo. 

Proof: Due to Step 3, and the definition of r, the set of values A > such that ||y — X^;^||2 > 
is nonempty. Let Ainf denote its infimum value. Take X e Ik for some k such that A > Ainf. In 
particular, A 7^ 0. Then, 

r^(A) = , , (C.99) 



j_ 

A2 



and we deduce that is increasing on 7^. By continuity, we have that F^ is increasing on 
(Ainf,T]. Once A > r, \\y - Xp^Wl = ||y||| and F^(A) = sAVI|y||2- Thus, liniA-^+oo Fa(A) = +00 as 



desired. 



□ 



The fact that F^ is increasing proves that the equation Fa(A) = Cyar admits at most one 
solution. 



C.4 Study of Fb 

Recall that 



^b{X) 



All/9. 



A 1 



|y-mili 



(C.IOO) 



We will use repeatedly that is unique for all A > and that the trajectory A 1-^ is continuous 
under the Generic Condition, see [15]. 

Lemma C.3: Under the Generic Position Assumption of [15], the function Tb defined by (C.IOO) 
almost surely satisfies 



lim Fb(A) = +00. 
A4.0 



(C.lOl) 



Moreover, almost surely, there exists r > such that F^ is decreasing on the interval (0, r] with 
Fb(t) = 0, while \\y — X^xlU is increasing on (0, r]. 

Proof: Let us first show that limA^o ^b{X) = +00. 
Let Ao > be sufficiently small so that for all A < Aq, |Ta| = n. Such a Aq exists due to Step l.a. 
Hence, since Xf^ is nonsingular: 

Pvr, = Id„. (C.102) 
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Thus, using (5.71), we obtain 

y-xA = -XXf^{X'^XfysgTi{pf^, (C.103) 

which implies that 

\\y-XMl = A'll(^f,^fJ"'sgn(%)||^. 
Moreover, Lemma 5.1 combined with (5.71) gives 

ll^.lli > inf(5,5)eE \\{XlXs)-\Xly - \5)\l - m' > 0. 

Hence, for A < Aq, 

Am' 



X'\\Xf^{XiX^J-hgn (%) 
Using the trivial fact that sup(5^)gs Il^s(^5^s)~^<^ll2 < the proof of Step 1 is complete. 



Let us now show that is decreasing on (0,r) by studying the function 

We immediately deduce from Step 2 and the definition of the intervals Ik, k e IC, that $ is 
differentiable on each Ik, k e IC. Using (5.71), its derivative on 1^ reads 

^(A) = ||%||i-Asgn(^^J*(XlX^J-isgn(3fJ 

Now, is non singular, 

\\y - X^,\\l = Al(Xl X^J-^sgn (^^.J > X^n a^^ ([X^X^y)' > 

for A > 0. Therefore Lb (A) < +00 on (0,+oo), is continuous on Ik and differentiable on Ik. 
Moreover, using (C.98), we have 

dVs ^ f(A)||y-X^A||i-^(A)^^^(A) ^ f(A)-2if 
^A \\y-xA\\i \\y-xA\\l' 

Hence, using (C.105) and (C.98), 

dr, ^ -||gfJ|i-A||(XlX^J-V^sgn(gj.J Hi 
^A^^ \\y-xp,\\l 



< — , ' < 



A^II(^|.,^fJ-^«gn (%) Hi ^ V ((^f.^Tj-^ 



on each Ik. We can thus conclude, due to the non-singularity of Xf^, that Fb is decreasing on 
(0, r), as announced. 

□ 
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